-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: improve map(), struct(), array() #8666
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me. Can you add a few tests in the appropriate ibis/tests/expr/test_{dtype}.py
files?
Also curious - what was the use case that led to you wanting this change?
in my first attempt at #8649, I went about it the wrong way, but I needed to ensure something was an Array, but the incoming thing might already have been an array. |
b6d983a
to
a9ca09b
Compare
OK, as I added those tests I found some problems, and this snowballed into also being a fix for #8289. The implementations for both of these things are somewhat intertwined. It would be a pain to fix one, and then just change it again to fixx the other. So I just did everything at once. |
a9ca09b
to
654bcda
Compare
9207925
to
b83016c
Compare
2fcf850
to
283fd6b
Compare
283fd6b
to
530fad0
Compare
My goodness this gotten a bit out of hand. EDIT: actually in the latest version it's not so bad. I didn't need to overhaul the internal representation after all. OK I just pushed an absolute doozy of a commit in 530fad0. Curious how many backends that fails on. It required really overhauling the internal representation of ops.Array and ops.Struct to be able to distinguish between
|
530fad0
to
55497bb
Compare
Hi @NickCrews, checking in here. What's the status on this one? It looks like there are a lot of conflicts and red on CI 😅 . |
Yeeesssss this needs a bit of love :) Once my other pr with .cases lands I figured I would pick this up again. |
55497bb
to
a00c26d
Compare
a00c26d
to
32e7636
Compare
2b7d36e
to
4d5c6e0
Compare
25c3e7b
to
b8dd34d
Compare
fixes ibis-project#8289 This does a lot of changes. It was hard for me to separate them out as I implemented them. But now that it's all hashed out, I can try to split this up into separate commits if you want. But that might be sorta hard in some cases. One this is adding support for passing in None to all these constructors. These use the new `ibis.null(<type>)` API to return `op.Literal(None, <type>)`s Make these constructors idempotent: you can pass in existing Expressions into array(), etc. The type argument for all of these now always has an effect, not just when passing in python literals. So basically it acts like a cast. A big structural change is that now ops.Array has an optional attribute "dtype", so if you pass in a 0-length sequence of values the op still knows what dtype it is. Several of the backends were always broken here, they just weren't getting caught. I marked them as broken, we can fix them in a followup. You can test this locally with eg `pytest -m <backend> -k factory ibis/backends/tests/test_array.py ibis/backends/tests/test_map.py ibis/backends/tests/test_struct.py` Also, fix executing Literal(None) on pandas and polars, 0-length arrays on polars Also, fixing converting dtypes on clickhouse, Structs should be converted to nonnullable dtypes.
b8dd34d
to
b9a6271
Compare
def visit_Array(self, op, *, exprs): | ||
return self.f.array(*exprs) | ||
def visit_Array(self, op, *, exprs, dtype): | ||
return self.cast(self.f.array(*exprs), dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems extraordinarily heavy-handed and overly broad.
Do we really need to cast the entire on constructing an array, for all SQL backends?
This PR is still too big. Can you split the array, map and struct implementations into three separate PRs? |
Yeah splitting it probably is good.
|
@NickCrews Just checking in here, can you split these up some more? |
OK, I just tried to split the struct and the map stuff into their own PRs, but both of those rely on the new |
Picking out the array stuff from ibis-project#8666
Picking out the array stuff from ibis-project#8666
Picking out the array stuff from ibis-project#8666
Picking out the array stuff from ibis-project#8666
Picking out the array stuff from ibis-project#8666
Picking out the array stuff from ibis-project#8666
Picking out the array stuff from ibis-project#8666
Picking out the array stuff from ibis-project#8666 Instead of trying to fit the two cases of 0-length and 1+ length arrays into the same op, I split them up into separate ones. By doing this, if we guarantee that all the elements of ops.Array() have the right type before construction, we don't have to do any fancy casting during compilation, all the elements will already have been casted as needed. This allows for the compilation of array<structs> on some sql backends like postgres. If we tried to cast the entire array, you end up with SQL like `cast [..] as STRUCT<...>[]`, which postgres freaks about. Instead, if we cast each individual element, such as `[cast({...} as ROW..., cast({...} as ROW...]`, then this is valid SQL. I added a Length annotation to ops.Array to verify the length is 1+. IDK, this isn't really needed, since if you ever did construct one, then the rlz.highest_precedence_dtype([]) would fail. But that might fail at a later time, and I wanted to raise an error at construction time. But, end users should never be constructing ops.Arrays directly, so this is a guardrail just for us ibis devs. So IDK, we could remove it, but I think it is a nice hint for future us.
fixes #8289
This does a lot of changes. It was hard for me to separate them out as I implemented them. But now that it's all hashed out, I can try to split this up into separate commits if you want. But that might be sorta hard in
some cases.
Changes:
This is adding support for passing in None to all these constructors.
These use the new
ibis.null(<type>)
API to returnop.Literal(None, <type>)
sMake these constructors idempotent: you can pass in existing Expressions into array(), etc.
The type argument for all of these now always has an effect, not just when passing in python literals. So basically it acts like a cast.
A big structural change is that now ops.Array has an optional
attribute "dtype", so if you pass in a 0-length sequence
of values the op still knows what dtype it is.
Several of the backends were always broken here, they just weren't getting caught. I marked them as broken, we can fix them in a followup.
You can test this locally with eg
pytest -m <backend> -k factory ibis/backends/tests/test_array.py ibis/backends/tests/test_map.py ibis/backends/tests/test_struct.py
Also, fix executing Literal(None) on pandas and polars, 0-length arrays on polars
Next steps
A possible nice follow-up would be to completely remove supporting structs, arrays, and maps inside ops.Literal.
Currently, when someone calls
ibis.array([1,2])
, that now results in ops.Array, but if they doibis.literal([1,2])
, it is represented as ops.Literal. It would be nice if inside ibis.literal() there was some logic likeand then all the backends wouldn't have to even worry about supporting these nested types in
visit_Literal()